Probability-Based Chinese Text Processing and Retrieval

نویسندگان

  • Xiangji Huang
  • Stephen E. Robertson
  • Nick Cercone
  • Aijun An
چکیده

We discuss the use of probability-based natural language processing for Chinese text retrieval. We focus on comparing different text extraction methods and probabilistic weighting methods. Several document processing methods and probabilistic weighting functions are presented. A number of experiments have been conducted on large standard text collections. We present the experimental results that compare a word-based text processing method with a character-based method. The experimental results also compare a number of term-weighting functions including both single-unit weighting and compound-unit weighting functions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه یک روش جدید بازیابی اطلاعات مناسب برای متون حاصل از بازشناسی گفتار

In this article a pre-processing method is introduced which is applicable in speech recognized texts retrieval task. We have a text corpus, t generated from a speech recognition system and a query as inputs,  to search queries in these documents and find relevant documents. A basic problem in a typical speech recognized text is some error percentage in recognition. This, results erroneously ass...

متن کامل

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Text1

Put forward a new method about automatic Chinese text segmentation based on Chinese characters string (CCS) frequency and length descending. It can automatically segment meaningful CCS in text based on processing longer string first and string frequency information, with no thesaurus, no acquiring the probability between words in advance and no Chinese character index. This method can effective...

متن کامل

Discovering Chinese Words from Unsegmented Text

In English written text, words are separated by spaces, but in written Chinese text, there are no such separators between words. (See Figure 1.) Thus, effective information retrieval of Chinese text first requires good word segmentation. In this paper, we investigate an efficient algorithm to discover the words and their occurrence probabilities from a corpus of unsegmented text without using a...

متن کامل

An Approach Based on Iterative Learning Algorithm for Chinese Text Hierarchy Feature Extraction without Lexicon

A great deal of information included in Chinese text is invaluable asset for further text mining, but the difference between Chinese and the western languages imposes restrictions on further utilization of Chinese text. No distinction indication between words by using spaces is one of the major differences between Chinese, also some other Asian languages, such as Japanese, Thai, etc., and Weste...

متن کامل

Image retrieval using the combination of text-based and content-based algorithms

Image retrieval is an important research field which has received great attention in the last decades. In this paper, we present an approach for the image retrieval based on the combination of text-based and content-based features. For text-based features, keywords and for content-based features, color and texture features have been used. Query in this system contains some keywords and an input...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Intelligence

دوره 16  شماره 

صفحات  -

تاریخ انتشار 2000